MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis
نویسندگان
چکیده
Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from information within current sentence. Whereas, context in neighboring sentences multi-scale nature of human are neglected, making it challenging to convert multi-sentence text into natural expressive speech. In this paper, we propose MSStyleTTS, a modeling method synthesis, capture predict styles different levels wider range rather than Two sub-modules, including extractor predictor, trained together with FastSpeech 2 based acoustic model. The predictor designed explore hierarchical by considering structural relationships global-level, sentence-level subword-level. extracts embedding ground-truth explicitly guides prediction. Evaluations both in-domain out-of-domain audiobook datasets demonstrate that proposed significantly outperforms three baselines. addition, conduct analysis representations have never been discussed before.
منابع مشابه
Expressive Speech Synthesis and Modeling
As human beings we communicate with each other through our feelings, which are expressions shaped by the experience and knowledge we have. Since every single state of humans can be related to a particular emotion, the role of emotions in communication cannot be underestimated. There have been studies of the human brain showing the impossibility to make appropriate decisions when the emotion-con...
متن کاملUncovering Latent Style Factors for Expressive Speech Synthesis
Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of “style tokens” in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We ...
متن کاملHierarchical stress generation with Fujisaki model in expressive speech synthesis
This paper introduces a hierarchical stress generation for expressive speech synthesis. In the previous study, we proposed a novel hierarchical Mandarin stress modeling method, and the text-based stress prediction experiments demonstrates a reliable stress assignment can be obtained from textual features. However, the stress model should be further verified to be an effective and efficient pros...
متن کاملModeling the prosody of Vietnamese attitudes for expressive speech synthesis
Attitudes or social affects are strongly implied in interaction processing, and specifically to socio-cultural aspects of language. This paper presents the modeling of attitude to apply in expressive speech synthesis in Vietnamese, an under-resourced tonal language. A prosodic model for Vietnamese attitude is proposed based on the concept of “rendez-vous” between linguistic levels and prosodic ...
متن کاملMulti-level Exemplar-Based Duration Generation for Expressive Speech Synthesis
The generation of duration of speech units from linguistic information, as one component of a prosody model, is considered to be a requirement for natural sounding speech synthesis. This paper investigates the use of a multi-level exemplar-based model for duration generation for the purposes of expressive speech synthesis. The multi-level exemplar-based model has been proposed in the literature...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing
سال: 2023
ISSN: ['2329-9304', '2329-9290']
DOI: https://doi.org/10.1109/taslp.2023.3301217